PROJECT DESCRIPTION

We have the following data from a beauty e-commerce in Russia:

We perform an exploratory data analysis, including a RFM analysis. After that, we perform a recommender system based on collaborative filtering.

ANALYTIC DATAMART

SET UP

LOAD DATA

DATA INTEGRATION

DATA QUALITY

Variable types

Delete the index column and convert event_time to a datetime format.

Null values analysis

Conclusions:

We are going to:

Analysis of numeric variables

We see negatives in the price. We are going to analyze it.

There are about 25000 records. Do the belong to certain products?

It does not seem to be a problem with a specific product, so we will delete all records.

Analysis of categorical variables

Set date as index

DATA TRANSFORMATION

We are going to create 3 types of new variables

Componentes of the date

Calendar variables: Local holidays

Since the e-commerce is set on Russia, wioth the leverage of the holidays packet, we import the local holidays from Russia.

We are going to include a variable that tells in each record whether it was a holiday or not.

Check

Other indicators

We will add indicators for Black Friday and Valentine's Day.

Check

ANALYSIS AND INSIGHTS

Events Anaylisis

Customer Journey

We analyze the possible events:

Conclusions:

How many products are viewed, added to cart, discarded and purchased on average per session?

Create df to a session and event level.

We move the events to columns.

We compute the average of each event per session.

Conclusion:

In each session, on average:

Are there any differences among events depending on the hour?

Create df to a event-hour level.

We move the events to the columns.

Visualize how the events are distributed per hour.

There is a global pattern as expected.

In order to explore the differences we can create a new variable that is the ratio of purchases per visit in each hour.

Visualize to see if there are hours when proportionally more is purchased.

Conclusions:

What is the average monthly billing?

What is the trend of events in the past months?

There is a significant peak in the week of the 24th, obviously because of black friday, we are going to do the same analysis but daily and only for November and December to see the effect.

Conclusions:

Day-time with highest number of purchases

We try to identify moments at the day-hour level when the highest number of purchases occur. It would be very useful to concentrate much of the campaign investment precisely at those times.

Analysis of customers

Create df with the variables of interest per customer: total number of products, total number of purchases, average proce per product and date of last purchase.

Rename variables.

Calculate additional variables: the customer's total spend, and the products per purchase.

For each of the customers we have:

How are customers distributed in terms of spending?

The vast majority of customers have spent less than €50.

How are customers distributed in terms of number of purchases?

The vast majority of customers make only one purchase.

How many products does a customer buy on average per purchase?

The maximum is quite high, probably from a customer who resells the products. To get rid of the outlier we analize the median instead of the mean.

The median purchase includes 5 products.

But 25% of customers buy more than 10 products in the same purchase. There is a lot of room for improvement in this ratio, for example, through recommender systems at the time of purchase

Which customers have generated the most revenue?

We calculate the best 10 customers with the average expenditure per customer.

Calculate the average total expenditure per customer.

There are customers with average spending dozens of times higher than the average.

These customers need to be retained through loyalty programs.

Survival analysis

Since we only have 5 months of history we will create a 3-month cohort analysis, which gives us 3 cohorts.

Create df just with the users who purchased a product and the month.

Move the months to the columns.

Delete user_id, we don't need it anymore. The period starts in october, so the month 10 is the cohort 1.

The first cohort will be that of month 2. We select "new" customers (those who were not in the previous month).

We convert to a binary dataframe ( we only care if that customer has purchased or not in each month, not the number of purchases)

Compure the percentage of customers in this cohort who have continued to buy in the following months.

Cohort 3

Cohort 4

Cretae df with all cohorts.

Conclusion:

90% of new customers do not buy again in the following months.

RFM matrix: On which clients to execute the next campaigns ?

We bulid the RFM matrix (Recency - Frequency - Monetary)

The variables for each dimension are:

We discretize the variables and create 5 groups.

Check

Create monetary

Create recency. It should be a number not a date. Therefore, we define it as the distance in days from the most recent dae.

We can now create the R, but keep in mind that in this case the lowest values are best, so it must be inverted to be consistent with the M and F.

Create additional variables that unifies the R, F and M:

On this dataframe we create a cube to make the analysis.

We analize each dimension of the cube and visualize it.

Average number of products, total number of purchases and total expediture made by each customer's group.

The result identifies the best customers to target when launching loyalty campaigns.

Analysis of products

Create df in a product level with the events and the price of the product.

We include the average price of each product, since the price might have changed over the time.

We integrate both dataframes.

What are the best-selling products?

Are there products that are not selling and could we eliminate from the catalog?

Conclusion:

Almost half of the products have not had any sales in the 5 months of the data history.

What is the relationship between price and sales volume?

Since this analysis includes sales, we will eliminate products that have not had any sales.

There is a clear decreasing relationship.

What are the most viewed products?

Are there products viewed but not purchased?

Remove the outlier and zoom into the many views few purchases window.

Recommender system

Uno de los activos que más pueden incrementar las ventas de un ecommerce es un sistema de recomendación.

Ya podríamos aplicar uno básico con los análisis de más visto y más vendido realizados anteriormente.

Pero la verdadera potencia viene cuando creamos un recomendador que personaliza para cada compra.

Tipos de sistemas de recomendación:

En nuestro caso vamos a desarrollar uno con filtrado colaborativo basado en items.

Los pasos a seguir son:

  1. Crear el dataframe con el kpi de interés
  2. Reducir la dimensión (opcional)
  3. Seleccionar una métrica de distancia
  4. Calcular la matriz item-item
  5. Crear la lógica de priorización

Create the dataframe with the kpi of interest

The KPI is the number of times the products have been purchased by the same user.

We will build the matrix user-product.

Due to PC constrains, we biuld the recommender system just for the 100 best-selling products.

Creamos la matriz usuario-item.

We create the user-product matrix:

Regarding to the distance metric, we use the euclidean distance (function spatial.distance.euclidean from Scipy)

Computethe matrix product-product

Recommender function

The following function queries the above table each time a user looks at a product or puts it in the cart, and finds the N_products with the shortest distance.

Check with one single product and multiple products.